{
 "metadata": {
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.7-final"
  },
  "orig_nbformat": 2,
  "kernelspec": {
   "name": "Python 3.6.7 64-bit",
   "display_name": "Python 3.6.7 64-bit",
   "metadata": {
    "interpreter": {
     "hash": "07829f70181eff8a51ba0c91cec5a602c5da58bbdadabc96e22cfe4f56b0b078"
    }
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2,
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from tools import *\n",
    "from func import *\n",
    "%matplotlib inline"
   ]
  },
  {
   "source": [
    "# Chap1 语言处理与Python\n",
    "\n",
    "目的：\n",
    "\n",
    "1.  简单的程序如何与大规模的文本结合？\n",
    "2.  如何自动地提取出关键字和词组？如何使用它们来总结文本的风格和内容？\n",
    "3.  Python为文本处理提供了哪些工具和技术？\n",
    "4.  自然语言处理中还有哪些有趣的挑战？"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "source": [
    "## 1.1  语言计算：文本和词汇\n",
    "\n",
    "### 1.1.1 Python 入门"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "8"
      ]
     },
     "metadata": {},
     "execution_count": 2
    }
   ],
   "source": [
    "# Python 是解释型语言，立刻执行\n",
    "1+5*2-3"
   ]
  },
  {
   "source": [
    "### 1.1.2 NLTK 入门"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "*** Introductory Examples for the NLTK Book ***\n",
      "Loading text1, ..., text9 and sent1, ..., sent9\n",
      "Type the name of the text or sentence to view it.\n",
      "Type: 'texts()' or 'sents()' to list the materials.\n",
      "text1: Moby Dick by Herman Melville 1851\n",
      "text2: Sense and Sensibility by Jane Austen 1811\n",
      "text3: The Book of Genesis\n",
      "text4: Inaugural Address Corpus\n",
      "text5: Chat Corpus\n",
      "text6: Monty Python and the Holy Grail\n",
      "text7: Wall Street Journal\n",
      "text8: Personals Corpus\n",
      "text9: The Man Who Was Thursday by G . K . Chesterton 1908\n"
     ]
    }
   ],
   "source": [
    "# 不知为何国内用不起，只能自己下载数据文件，解压到 C:\\nltk_data 目录下\n",
    "# 数据下载：https://pan.baidu.com/s/1rDpWeknm13dyoyjsqu7zFg 提取码：eq5x\n",
    "\n",
    "# nltk.download()\n",
    "from nltk.book import *"
   ]
  },
  {
   "source": [
    "### 1.1.3 搜索文本"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "=============== >搜索单词 monstrous，显示其所在的上下文< ===============\n",
      "Displaying 11 of 11 matches:\n",
      "ong the former , one was of a most monstrous size . ... This came towards us , \n",
      "ON OF THE PSALMS . \" Touching that monstrous bulk of the whale or ork we have r\n",
      "ll over with a heathenish array of monstrous clubs and spears . Some were thick\n",
      "d as you gazed , and wondered what monstrous cannibal and savage could ever hav\n",
      "that has survived the flood ; most monstrous and most mountainous ! That Himmal\n",
      "they might scout at Moby Dick as a monstrous fable , or still worse and more de\n",
      "th of Radney .'\" CHAPTER 55 Of the Monstrous Pictures of Whales . I shall ere l\n",
      "ing Scenes . In connexion with the monstrous pictures of whales , I am strongly\n",
      "ere to enter upon those still more monstrous stories of them which are to be fo\n",
      "ght have been rummaged out of this monstrous cabinet there is no telling . But \n",
      "of Whale - Bones ; for Whales of a monstrous size are oftentimes cast up dead u\n"
     ]
    }
   ],
   "source": [
    "show_title(\"搜索单词 monstrous，显示其所在的上下文\")\n",
    "text1.concordance('monstrous')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "=============== >搜索与 monstrous 具有相同上下文的单词< ===============\n",
      "--------------- >text1< ---------------\n",
      "true contemptible christian abundant few part mean careful puzzled\n",
      "mystifying passing curious loving wise doleful gamesome singular\n",
      "delightfully perilous fearless\n",
      "--------------- >text2< ---------------\n",
      "very so exceedingly heartily a as good great extremely remarkably\n",
      "sweet vast amazingly\n"
     ]
    }
   ],
   "source": [
    "show_title(\"搜索与 monstrous 具有相同上下文的单词\")\n",
    "show_subtitle(\"text1\")\n",
    "text1.similar('monstrous')\n",
    "show_subtitle(\"text2\")\n",
    "text2.similar('monstrous')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "=============== >搜索 monstrous 与 very 两个单词的相同上下文< ===============\nam_glad a_pretty a_lucky is_pretty be_glad\n"
     ]
    }
   ],
   "source": [
    "show_title(\"搜索 monstrous 与 very 两个单词的相同上下文\")\n",
    "text2.common_contexts(['monstrous','very'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "output_type": "display_data",
     "data": {
      "text/plain": "<Figure size 432x288 with 1 Axes>",
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAZkAAAESCAYAAAAv0qjVAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvhp/UCwAAH49JREFUeJzt3X2UHVWZ7/HvLwSIgKbhElFH0w2IoqCAtAoodCPiwlGGEReg+ELUMYwvaFyzVNAZ0sPogCNewwAqUYcwMlcFgRlBQZChCeHVDi+KIwrBoKjMRCDXGwF5e+4ftYuuVM5r99ndnfD7rHXWObVr17Of2qdOP13VlRNFBGZmZjnMmu4EzMxs0+UiY2Zm2bjImJlZNi4yZmaWjYuMmZll4yJjZmbZuMjYlJO0TNKHexzzdEkX9iDO3pLukvSApFM73GaZpDVpu1skvaO2fiCtf9Vk88uhV3PXIv6ApCfS/PxS0n9Iem5a1/V8p+2GJf1Trpytd1xkbFNxL7B6skEiYmVEvBD45y43/fu03VuBv5N0ZGXdw8AvgD9MNr9MejJ3bTyY5mcn4GfAaTCp+R5IsWyGc5GxGUPSZpJOkfRTSXdK+rvUvoOk30l6Y1o+TNLPJW2Tlu8APgIc0CDmotT3bknfqbT/SxpjlaTLJG3Xi32IiLuBU4D3pHE+CFwLvAJ4YS23Q9OZz52SVkvaIbUvk/QPkpZLukfSKZVtXidpLG3z/coZwQJJZ0n6atrXUUnPTOtenJbvSPH+ohKv4dxJekGKvyq9H4el9mFJl6ccV0u6VtK2XcxPABcBu7brK+kASTen/Vku6aXpGLkL+BxwSDoLukvS6zvNwaaWi4zNJAuArYHdgZcDfyFpKCL+GzgGOEvSjsAS4G0RsQ4gInYF3lEPJukIih/2r42InYBjK6v/JiJ2iYidgd8A7+/hfvyE9Ft2RHwp/aZ+Y4N+pwPvj4hdUv//qaw7EngnsBvwTkkvS4Xwi8AhaZurgH+sbHMMcEXa14eAI1L7CcAlEbFrRPQD3y83aDZ3wDfSNjsDbwOWSXpOWve6tG4AeBA4qoM5AUDSlhTvyaVt+m0LfBt4b9qfc4FzI+KJNJ+fBC6LiBemxw87zcGm1uzpTsCs4s+BVwMHp+VtKH77vzoiLpf0LeBmYCQibukg3uHAkohYAxAR91fWHSXpr4EtgXnAst7sAgBzgD910O+7wNclfQ34ZkT8vrLu9Ij4FYCk6yjOhAYoitEKSVB8fldXtrksIs5Lrz9dyeFS4AuSngX8W0T8vFVS6QxxH+AggIj4iaSbgVdRXPK7NSLKs8I7gB062Nc+ST+juHT4PeAzbfrvB/w0Im5Ny18HTpf0rIiYqZcdrQEXGZtJZgOfjohzmqzfiu7OvucAj9cbJe0GjAB7RsR9kkYoClqv7AXc3q5TRHxE0quBtwM/kXRYRNzUoOscYB3F/IxFxIFNQj61r9UiHBHflnQTxRnJDyR9PiLObJFaOcdPVtMFHm3SXy1ildZGxEs66FfarMH4TzK+j/7SxY2EL5fZTLIc+Ov0GzeSNpc0O70+Eng9sDdwgqR9O4g3CiysxCv/7vJs4AHgfyQNAG/p1Q5I2gv4OMWlsHZ9nxkRN0bEIuBKit/eS+V+7wq8ElgB3AQMStqvEmOrDsf5ZUScTHEGcXCr/ulM4WcUly+R9DLgpcD17cbqoZuAvSWVhemvgGsj4qG0vAbYVdIWKcdOCp1Nh4jww48pfVBcmnqQ4q6me4ELU/uWwBkUl4DuoDgb2AHYhaIovDr1OwL4NcVlrpcAd1H8XeWR9Pr41G9z4H8DvwTuBr6d2mcD/w7cA9wALAVOTesWpBgPpBzvAg7sYH/WpL7XAAdX1n05tT8M/Da93iqtuzPldTdwPtBXibc2ta8EXl+JdzjwU2AVRSH4q0re32mS37+mOVhFcRPCy1N7q7nbLfVdlXI4ILUPU5xNlbFPpbh82Wp+BoDfN1nXdL4piv/PUtulwAsq280GvpWOn7urc+7HzHoovWFmNkNIWkbxg/yM6c7FbLJ8uczMzLLxmYyZmWXjMxkzM8vGRcbMzLJ52v87me233z4GBgamOw0zs43KypUrfx8R89r1e9oXmYGBAcbGxqY7DTOzjYqkezrp58tlZmaWjYuMmZll4yJjZmbZuMiYmVk2LjJmZpaNi4yZmWXjImNmZtm4yJiZWTYuMmZmlo2LjJmZZeMiY2Zm2bjImJlZNi4yZmaWjYuMmZll4yJjZmbZuMiYmVk2LjJmZpaNi4yZmWXjImNmZtm4yJiZWTYuMmZmlo2LjJmZZeMiY2Zm2bjImJlZNi4yZmaWjYuMmZll4yJjZmbZuMiYmVk2LjJmZpaNi4yZmWXjImNmZtm4yJiZWTY9LTISh0ncWGvbRWKok74bm+FhGBkZfzRqKx8DA8VzX1/xXPYbHh6PV2+vbl9dhiJeo7Hq/fr6Gudez7lU9i/X1fevvm21vZE5c4r4s2aNjzMwsH7+9bjlXNUf9bHreZbqcct41eX6No32o942PLx+7Ha5leNUt20Uv1l7szlolUujGI32pYzRbA7q/ZuNU9dou3bHyKxZzcdvtm2z9vpcNmrvJGb5OW21Tav9ajZeo89rq/iN3tf657XRz5dmPw+mgyKid8HEbGDbCNZU2hYAAxGMtOs7HQYHB2NsbGxC20rrL0ds2NaJ8i3odNtOxyn7NXqLy+3rfcrX1edmebaKXx+nVe6d7n9122b51vNrttxsv+u5V9vqc9FsuVHM+rpmedTbG81Bq1xavY+N9q9Vvq1yavaeN5uzdsdIu+Ow3TjdtHfap5p3J8dHJ+M1e09bvT/1PBotd6KHP+rT2FoZEYPt+s2e+AC8GfgU8DhwAXAf8AFgID2QWAIcAsyRGAauj+AEiaMa9H0h8LUU/oXAN1LfAeArwFbAb4F3AX8GnAM8AOwEnBXBlySeC5wNbA7MiuDAie6fmZlN3oTOZCT6gBuAfSN4UEIRRFq3OqIoHGl5AQ3OZBr1TW39wIXAwRE8IHE+8PkIbpI4FVgJXA9cB+wNPAKsiGA3iXcC+0Xwwdb5ayGwEGD+/Pl733PPPV3PQcp1PT6TaT5Oq9x9JuMzGZ/JbLpnMhP9m8yLgNsieBCgLDCTJbEFcC5wbAQPpObdgX+SGAWGge1T+y8i+F3KYevUdh7woMQ1EidINHwLImJpRAxGxOC8efN6kbqZmTUw0ctlvwJeLjEngkeqZzINPARs12Hc04BzI6j+kWQVsDiClWVDuoS2gQgeBT4tMQu4FFgBXNPh2GZm1mMTKjIR3CdxGnCdxDrgfIk5wJuA56SzjsURXA1cASySuIbistYJEh+v9wXWAe9LMd8O3BTBJ4BPAl+SeBJ4DFhEUbg2IPGeFOMJYA2MF6YchoY2vJulURvAsmWwYAEsWQKLFsHoaNFvdHS8z+LF67dX49SX+/vH79CqqvebO7dx7osXr59zvX+5vtqv0baN1ldtuSXssw8sXw4HHDCeO2yYfxmrnKt2eTfLs/4e9PevH68cv91+1NvKeSpjN8ulPk5120b96+tazX23ubSKVZ+nVvPRSU6ttmulvOTTaPxm2zdrbzTP9fZOYs6dW3xOW23TyftT71ffpvz8dxq//BlRz7P+86Uav9kdd1Olp3eXbYwmc3eZmdnTVe6/yZiZmbXlImNmZtm4yJiZWTYuMmZmlo2LjJmZZeMiY2Zm2bjImJlZNi4yZmaWjYuMmZll4yJjZmbZuMiYmVk2LjJmZpaNi4yZmWXjImNmZtm4yJiZWTYuMmZmlo2LjJmZZeMiY2Zm2bjImJlZNi4yZmaWjYuMmZll4yJjZmbZuMiYmVk2LjJmZpaNi4yZmWXjImNmZtm4yJiZWTYuMmZmlk1XRUZiWGJZplwss5GR3sfqNubISG/zsOk1PFw8eqnZ8TEyAgMDxaM8jvr6iuUyl3Lbak6TzW8qjtdG81jdz/qj3KbU6PM4Uz5niojOO4thYEEEC3IlNNUGBwdjbGxsutOYEhJ08XZ3FKvbmFLx3Ks8bHrleD+bHVPlWI2Ux2L1dRljssd9Lz83rcaA9cfpZH/r+9jL/W5H0sqIGGzXr6MzGYkvSFwNnJmW95K4SuJaidNS24jEVyVukPiixBWpfQ+JKyWulviOxNapfR+J5an95NS2QOJMiYslbpPYP7V/OfW9TeI9qW0zidMlVkhcJ7GTxNESX63kfbvEDl3NnJmZ9czsdh0kDgG2jWCoPJOhKDZHRnBvKhz7pu7/BWwJXAPsmdq+Arw7gjslTgIWSiwBlgGHRLBaolqzB4GDgFcAx6RYx0XwuMR2wArgbOC9wKwIXpvyFPAb4LMSWwG7Ar+M4L833CctBBYCzJ8/v4NpMjOziWhbZIA9gBtqbS8Fzk2nc33Ac1L7Lan/zfBU4Xh2BHem19cBhwLzgP8bwWqACKondZdGsA5Ynh4An5R4AxDAVqltL+DicqMU408S/wEcTlHkzm60QxGxFFgKxeWydhNgZmYT08nlsvuAHdPrQ9PzKuDoCIYj2DOCi1ps/weJndPr/SkK0f3ADhLbw1NnIQ1J7Ai8FTiQ4symdDewX6VfGeOrqd9BwCXtd8/MzHLp5EzmQuDi9DeWG1Pbx4ALJB4BHgWObrH9R4BvSjxMUZxGInhC4mPAZRIPUVwC+1ST7X8LrAV+CNwBPJjavwKcI3FtWn4XcHcEP01/97kqgkc72L+njcWLex+r25i9zMGm39BQ72M2O0YWL4Zly4rXCxYUz0uWFHeYlbmUd1xV85psjlNxzDbKsb9/fD/bbdPo8zhTPmtd3V22MZDYhuLy3psiuKdd/6fT3WVmZr3S07vLNhYSi4ArgL/tpMCYmVlenVwu22hEsARYMt15mJlZYZM6kzEzs5nFRcbMzLJxkTEzs2xcZMzMLBsXGTMzy8ZFxszMsnGRMTOzbFxkzMwsGxcZMzPLxkXGzMyycZExM7NsXGTMzCwbFxkzM8vGRcbMzLJxkTEzs2xcZMzMLBsXGTMzy8ZFxszMsnGRMTOzbFxkzMwsGxcZMzPLxkXGzMyycZExM7NsXGTMzCwbFxkzM8vGRcbMzLLpeZGReLvETRLXSOyUIf7qXsecKn196y8PD4+/HhnJN+7ISPEox6uP1auxyzjVsZr1qffvNo9m+zIdyvkdGFi/rVX/iazrtF+zPNrF7mR99b0qXzc7juvHez2XaoxWcVoZHl5/7stjr3yMjMDs2evHHBgociv7DgwUz3PmNN6n+rFWza0+1/VcyudyvDJemdvw8Pi6aqxqvGafj0bxyjh9fcV+9/XBrFmN53uqPjuKiN4GFL8AXhHBup4GHo+/OoKBXsUbHByMsbGxXoVrSYLqdFeX6+t6PW4ponUekx2njF+O1axPvX+jvDodK9e8dao+v2Vbs7wmum4iMbo5xjpZDxu+x52OV19Xmkiu9ZzqedXVj7NWmu1To2Ou1T41Gq/d+PX5apZTPa9u9OpnjqSVETHYrl+Oy2WbVwuMxIjEKRKXStwq8eLU/gmJ5ems5+DUtp3ERRKjEt+T6EvtR0uskLi2Erdf4vup72USz5EYljhf4qcSn5X4icSzM+yjmZl1oGdFRuKjEqPAc9MP/lGJzdLqlwCHAl8AjpTYHRiK4ADgz4FTU78TgG9GMAxcCnxAYlvg48DrInhNZcjPA6emvuelbQG2SOM8E7gQiqK2fq5aKGlM0tiaNWt6MwFmZraB2b0KFMFpwGnpctZw2Z5O5S6K4HHgG6ntKGC3VJQAnpGedwf2l/ggMAe4GngR8LMIHq0NuTs8dWZzHfDW9PoW4EngZmAAnip0lVxjKbAUistlE9lfMzNrr2dFpkurgB8Dh0UQtfarIrigbJDoB3ZMrw+Fp/rfBewLjAL7UxQXMzObQaalyEQwJnE7cL3EH4EbI/gUcDJwtsRxwGPAZyMYTX/LWQ7cAJTXt44Hlko8kdreB+w15TvThblz118eGhp/vXhxvnHL2KOjjcfq1dhlnMWLx8dq1qfev9s8yrnLOW+dKnNYtmzDtlb9u13Xab/+/sb92sXuZn31dbPjuH68t8tlIp+HoaHizqpy7uvH3vAwfOYz68fs74e1a2HRoqLv6tXFHVk33ADHH79hLvVjrZpbo7mu5lI+L1lSjFeNNzxcjH/rrcW66vFTH6fVcjVeOf7atbBuHWyzDfzhD3DiiWxgqj47Pb+7bGMzlXeXmZltKqbz7jIzMzPARcbMzDJykTEzs2xcZMzMLBsXGTMzy8ZFxszMsnGRMTOzbFxkzMwsGxcZMzPLxkXGzMyycZExM7NsXGTMzCwbFxkzM8vGRcbMzLJxkTEzs2xcZMzMLBsXGTMzy8ZFxszMsnGRMTOzbFxkzMwsGxcZMzPLxkXGzMyycZExM7NsXGTMzCwbFxkzM8vGRcbMzLJxkTEzs2xcZMzMLBsXmUkYGWm/vuxTfwYYHh7vU+0LMDBQrG82ZrOx63H6+tZfrm5fxi+f69s20yivZlrlCcV+NlLNu8yrbGsUs5pTo/X1tlb72WxduzHaxanOc/lcb2sWp9373miMRvGbzXejMbvV6TYTid3rHDbV8aH7YyU3RcTUjARIDAMLIljQZP3ewJMR3FJr/wiwfwRH9DqnwcHBGBsbm9C2ErSaPql4jhjvW92mXF9VX1eP3yhOszGbLZfbN8qt0ZiNxuj0sGmVZ6f70WqeGo3TKGa9rdU+tMu53fbtxmz1HjSL00m/+hilZsdfN7l3ImfsXueQy3SPX82h02Nl4uNoZUQMtus3085kDgX2aNC+FFg4xbmYmdkkTUmRkfiCxNXAmWl5dWXdqMSAxLeABcDxqe2jaf1HgVHgolrML6Z+10nsmdp2k7hS4iqJbzbPRwsljUkaW7NmTW931szMnpK9yEgcAmwbwRDwoWb9IngbsAw4JYLhCE5L7acBb6vFfBOwZQTDFGc4n0ur3gL8IIIDI3h787FiaUQMRsTgvHnzJr5zZmbW0lScyewB3NDjmLsDr5MYBc4A5qT2M4D56QzHl9fMzKbZ7CkY4z5g1/T60PT8R4mtgc2BHSt9HwK26yDmKuCKCI6rNkawFviwxBbAzRKXRvDrSWXfwuLFna8vX1fbhoaa3+HR39/4TqBGcVrlNHcuLFrUePvR0fE8WsWsK/t3ol2e/f2N11fzLvsuWbL+vjTLqdGY9bZW+9psXbsx2sWpz3P1Peg0n3bj1seox282363G7FSn20wkdq9z2FTHr+bQ6bGSW/a7yySeCVwMPAbcCDwfuIWi4KyhKDJvi2C1xM7A+cD/Ay6M4DSJLwCvoShUtwLHAndR3AywK/Aw8N0I/lni08AhQAC/ABZG8GSr/CZzd5mZ2dNVp3eXTektzDORi4yZWfc21luYzcxsE+IiY2Zm2bjImJlZNi4yZmaWjYuMmZll4yJjZmbZuMiYmVk2LjJmZpaNi4yZmWXjImNmZtm4yJiZWTYuMmZmlo2LjJmZZeMiY2Zm2bjImJlZNi4yZmaWjYuMmZll4yJjZmbZuMiYmVk2LjJmZpaNi4yZmWXjImNmZtm4yJiZWTYuMmZmlo2LjJmZZeMiY2Zm2bjImJlZNi4yZmaWzUZRZCQukPjQdOfRyvAwjIwUr0dGxh9ToTpur2NuKqZ6fzam+evr667/8HDxPDBQPKrLw8Pjj2rf+nxU+1TbOvnsNIpV5tJKqxzq21ZzqfatL8+aVbTNmlXMYz336v7Xf0aUy83ma3gYZs8u4pbP9W1mzy5y7+sbH39gYLx/OS9z5mw4Rrv56hVFRO+CiW2A3wBfjuD4HsbdFng4gkd6FbM0ODgYY2Njk44jFc8R46/L5dyk8XF7NV4vY80EU70/G9P8dZtr9Xgr1ZerbY2Ozernpd5W3b6TfDv9vLXKoV3MdvtaV8bqdK7qebSLP1Gt3pNuSVoZEYPt+vX6TOYvga8Dh0tI4ocSP5L4vMRPJN5YJMcnJJZL3CRxcGpbIHGmxMUSt0nsn9rPBq6H8aIlsZnE6RIrJK6T2Cm1fznFvU3iPT3eNzMz69LsHsd7B/BxYFdgX2A34GDgG8DHgH0lfg0MRXCAxPbAlcAeaftB4CDgFcAxwDURvEdiATBQGee9wKwIXgsgUdb94yJ4XGI7YAVwdqMkJS0EFgLMnz+/B7ttZmaN9KzISMwD+iO4XeIC4Gjg58A64DbgcWAzisKzm8Ro2vQZlTCXRrAOWJ4ezewFXFwuRFCe9H1S4g1AAFs12zgilgJLobhc1uk+mplZd3p5JnMUMFfiBmBz4HnAXQ36rQJ+DBxWKQ7duhvYD7gUnjqTGQDeSnE29ALg6gnGNjOzHullkTkaOCiCOwAkvgvMq3eKYEziduB6iT8CN0bwqWZBJa4E/gyYI/HqCN4IfAU4R+La1O1dFDccrAV+CNwBPNi7XWtvaGj8ro3Fi6dy5PHxejnuVO9DbtP1nmwM5s7trv/QUPHc3188l3cp9fdveMdS2bc+H2V7va1+x1kjjWKtXj2x7UrlvjTLpexbtpXLy5fDiSfCSSfBs54FixY1HmPxYhgdXf9nRLk8Otq8/4oVsM02sG5d8bznnutvs2IFPP/5sHZtsbxoESxbBvfeW/Qv7xy87z7YZ5/1x6jvcy49vbtsY9Sru8vMzJ5OpuvuMjMzs6e4yJiZWTYuMmZmlo2LjJmZZeMiY2Zm2bjImJlZNi4yZmaWjYuMmZll4yJjZmbZuMiYmVk2LjJmZpaNi4yZmWXjImNmZtm4yJiZWTYuMmZmlo2LjJmZZeMiY2Zm2bjImJlZNi4yZmaWjYuMmZll4yJjZmbZuMiYmVk2LjJmZpaNi4yZmWXjImNmZtm4yJiZWTYuMmZmlo2LjJmZZeMiY2Zm2bjImJlZNi4yZmaWjSJiunOYVpLWAPdMcPPtgd/3MJ0cNoYcwXn2mvPsrY0hz6nOsT8i5rXr9LQvMpMhaSwiBqc7j1Y2hhzBefaa8+ytjSHPmZqjL5eZmVk2LjJmZpaNi8zkLJ3uBDqwMeQIzrPXnGdvbQx5zsgc/TcZMzPLxmcyZmaWjYuMmZll4yJjZmbZuMh0SdJmks6UdLWk6yS9bArH3krSuZKuknSjpBdLepWk5ZKul3Rqqxy76dujfI+T9LikgZmaZxrrUkn/Kel9MzFPSXMknZfi/VjS4TMlz5TbhyWtlrSg2/Em23eSeb5Z0qiklZI+M1PzTO3bSLpL0rKZkGdXIsKPLh7AUcBZ6fXrgMunePyd0/NC4DTgNmB+alsO7Ncsx2769iDPAeAHaZyBmZgn8EzgemBepW0m5jkEfCO9HgS+O1PyBPqB9wOfBRbknMNGfSeZ53xgc2Az4FfAtjMxz9R+BnAKsCwtT2ue3Tx8JtO9/YDvSRoEPgq8eCoHj4hV6eXzgAeAJ4DfSfocsAXwykY5Stq2076TzVGSgNOB44AnKT68My5PYP8U/xxJ10g6ZobmeQOwjaSvAZ8B/n6m5BkR90TEV4HHALoZr0d9J5RnavtVRDxG8cvGI8C6mZinpKEU57JK12nNsxuzcwTdxAl4C3AXcATwX1OegPSXwF4UZzNHUfyW8xXgXopLoI1yFDCnw76TtRC4KiJ+UdSbrsaeyjyfBYxFxLGS5gE3AQ/P0Dy3Bi6l+OXikC7Gnso86XK8XvSdXLLS1sD/AT4YEY+lX5BmTJ6StqL4peIwis/8U6tmUp4t5Tg92pQfwLuBf0mv9wEumeLx3wv8G7BFWr6L4nKUKH4IvbJZjt30nWSOl1Ccfo8Cayl+E48ZmOfewGXp9Vzg9hk6n+8DTk2vnw/cOtPyBEYYvwyVJbdGfSeZ5/bA5cBQu8/3dOUJvB74McVn6VbgPuDkmZJnR/uSI+im/KC4hnsuxQ/RK4FdpnDsQYpT3PIH+OUU11h/BFwDnNgqx2769jDn0XQgz8g8Kf6udR1wNXDQTMwTeC7wn2kuVwBvnSl5UhS9MeC3wC+Bf82VW6O+k8zz4vR6ND2Onol5VtYNM/43mWnNs5uH/8W/mZll4z/8m5lZNi4yZmaWjYuMmZll4yJjZmbZuMiYmVk2LjJmU0zSMknDE9z2b1R8b90KSc9Ibc9P38F1taQPTyDmLulflZv1nIuMWRsqvpD0f1WWT5F0QMbxhtMXF14r6SxJs1P7M4BjgX0j4rUR8XDa5APAORExFBFnTGDI1wAH9iZ7s/W5yJi1dz/QV1l+FnC/pD0kXZnOIL6TvqIESWOSTkrrzk9tAyq+7XkUeHOzgSTNAb4EHB4Rr6H4h3RHpNXPBn4bEU/WNnsexT/cK2PMknR2Gn+FpN1S+16pYF4r6bTUtgQ4HliQzoZOnuAcmTXk7y4za+8BYK6kSyi+oHIuReG5CHh3RNwp6SSK72z7IsXXldweESdKulzSAMU36J4cEcuVvq69iRcBd0TE79PyvwMHSXo5xdnGrqlQrQH+FjgL2BXYS9Ja4PMU3667J/DqiHi0EvtM4MiIuDcVxX0jYpGKr5QfiIiRiU+RWWMuMmbt3Q/sQvFt0sdQnMk8ADw7Iu5Mfa4DDi03iIjz0vMbACTtQfEdbu3MpvjqoKdCAY9GxAmpWC2LiOHK+uFUtJZFxGjZmM5ULpd0B/CpiHgAeClwbvrS0j7gOR3kYzYpvlxm1t79FF9MehKwO/CMdIbwB0k7pz77A7e0iHEfsKOk7Sj+BtLMz4E901exAxxJ8b1qXYmIshj9GvhQal4FHB0RwxGxZ0RclNofArbrdgyzTvhMxqy9+4GXAz+kuAx1bGr/CPBNSQ9T/AAfaRHjH4HzKP528qNmnSLij5KOB34g6U/A1RFxSTfJprOmL1H8nyQBfDCt+hhwgaRHgEcpCs79wBXAIknXACsi4oRuxjNrxV+QaWZm2fhymZmZZeMiY2Zm2bjImJlZNi4yZmaWjYuMmZll4yJjZmbZuMiYmVk2/x85RH5LUuXNggAAAABJRU5ErkJggg==\n"
     },
     "metadata": {
      "needs_background": "light"
     }
    }
   ],
   "source": [
    "# 图1-2：美国总统就职演说词汇分布图：可以用来研究随时间推移语言使用上的变化\n",
    "text4.dispersion_plot(['citizens','democracy','freedom','duties','America'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stderr",
     "text": [
      "Building ngram index...\n",
      "laid by her , and said unto Cain , Where art thou , and said , Go to ,\n",
      "I will not do it for ten ' s sons ; we dreamed each man according to\n",
      "their generatio the firstborn said unto Laban , Because I said , Nay ,\n",
      "but Sarah shall her name be . , duke Elah , duke Shobal , and Akan .\n",
      "and looked upon my affliction . Bashemath Ishmael ' s blood , but Isra\n",
      "for as a prince hast thou found of all the cattle in the valley , and\n",
      "the wo The\n"
     ]
    },
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "\"laid by her , and said unto Cain , Where art thou , and said , Go to ,\\nI will not do it for ten ' s sons ; we dreamed each man according to\\ntheir generatio the firstborn said unto Laban , Because I said , Nay ,\\nbut Sarah shall her name be . , duke Elah , duke Shobal , and Akan .\\nand looked upon my affliction . Bashemath Ishmael ' s blood , but Isra\\nfor as a prince hast thou found of all the cattle in the valley , and\\nthe wo The\""
      ]
     },
     "metadata": {},
     "execution_count": 8
    }
   ],
   "source": [
    "# nltk 3 已经放弃这个功能\n",
    "text3.generate()"
   ]
  },
  {
   "source": [
    "### 1.1.4 计数词汇"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "text3 中的单词个数： 44764\ntext3 中的单词数： 2789\ntext3 中的单词排序： ['!', \"'\", '(', ')', ',', ',)', '.', '.)', ':', ';', ';)', '?', '?)', 'A', 'Abel', 'Abelmizraim', 'Abidah', 'Abide', 'Abimael', 'Abimelech']\n"
     ]
    }
   ],
   "source": [
    "print(\"text3 中的单词个数：\", len(text3))\n",
    "print(\"text3 中的单词数：\",len(set(text3)))\n",
    "print(\"text3 中的单词排序：\", sorted(set(text3))[0:20])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "text3 中每个单词被使用的次数的平均数： 16.050197203298673\ntext3 中每个单词被使用的次数的平均数： 16.050197203298673\ntext5 中每个单词被使用的次数的平均数： 7.420046158918563\n"
     ]
    }
   ],
   "source": [
    "print(\"text3 中每个单词被使用的次数的平均数：\",len(text3)/len(set(text3)))\n",
    "print(\"text3 中每个单词被使用的次数的平均数：\", lexical_diversity(text3))\n",
    "print(\"text5 中每个单词被使用的次数的平均数：\", lexical_diversity(text5))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "text3 中单词 'smote' 的使用次数： 5\ntext3 中单词 'smote' 的出现比例： 0.01116968992940756\n"
     ]
    }
   ],
   "source": [
    "print(\"text3 中单词 'smote' 的使用次数：\", text3.count(\"smote\"))\n",
    "print(\"text3 中单词 'smote' 的出现比例：\", percentage(text3.count(\"smote\"),len(text3)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "text4 中单词 'a' 的出现比例： 1.4643016433938312\ntext4 中单词 'a' 的出现比例： 1.4643016433938312\n"
     ]
    }
   ],
   "source": [
    "print(\"text4 中单词 'a' 的出现比例：\", 100*text4.count('a')/len(text4))\n",
    "print(\"text4 中单词 'a' 的出现比例：\", percentage(text4.count('a'), len(text4)))"
   ]
  }
 ]
}